Summary

In the previous document cdi_schubert.Rmd, we explored two microbiome datasets for FDR benchmarking. Here we will make use of the new SummarizedBenchmark package to perform the benchmarking in a standardized way.

Workspace setup

# project directory & data/results folders
setwd("/Users/claire/github/benchmark-fdr/datasets/microbiome/")
datdir <- "/Users/claire/github/benchmark-fdr/datasets/microbiome/DATA/"
resdir <- "/Users/claire/github/benchmark-fdr/datasets/microbiome/RESULTS/"
  
# wrangling & plotting tools
library(data.table)
#library(readxl)
library(dplyr)
library(ggplot2)
library(magrittr)
library(R.utils)
library(cowplot)

# benchmark methods
library(IHW)
library(ashr)
library(qvalue)
library(swfdr)
library(fdrtool)
library(FDRreg)

# comparison tools/functions
library(SummarizedBenchmark)  
sourceDirectory("/Users/claire/github/benchmark-fdr/datasets/R/")

## set up parallel backend 
library(BiocParallel)
cores <- as.numeric(Sys.getenv("SLURM_NTASKS"))
multicoreParam <- MulticoreParam(workers = cores)

Let’s just keep going and hopefully it’s okay…

CDI Schubert - diarrhea dataset

Here we download and analyze the CDI Schubert diarrhea (C. diff and non-C.diff diarrhea) dataset. We’ll download the processed OTU tables from Zenodo and unzip them in the benchmark-fdr/datasets/microbiome/DATA folder.

Data download

cd /Users/claire/github/benchmark-fdr/datasets/microbiome/DATA/
curl -O "https://zenodo.org/record/840333/files/cdi_schubert_results.tar.gz"
tar -xzvf cdi_schubert_results.tar.gz
cd ..

Next, we’ll read in the unzipped OTU table and metadata files into R.

We’ll need to manually define which disease labels are of interest and what metadata column contains the sample IDs (i.e. the IDs in the OTU table).

otu_path = paste0(datdir, "cdi_schubert_results/RDP/cdi_schubert.otu_table.100.denovo.rdp_assigned")
meta_path = paste0(datdir, "cdi_schubert_results/cdi_schubert.metadata.txt")

# DiseaseState labels to keep
labels <- c("H", "nonCDI", "CDI")

# Metadata column that has sample IDs
col_label <- 'sample_id'

# load OTU table and metadata
otu <- read.table(otu_path)
meta <- read.csv(meta_path, sep='\t')

# Keep only samples with the right DiseaseState metadata
meta <- meta %>% filter(DiseaseState %in% labels)

# Keep only samples with both metadata and 16S data
keep_samples <- intersect(colnames(otu), meta[, 1])
otu <- otu[, keep_samples]
meta <- meta %>% filter(get(col_label) %in% keep_samples)

Since we’ll be using OTU-wise covariates, we shouldn’t need to perform any filtering/cleaning of the OTUs, apart from removing any that are all zeros. (This may happen after removing shallow samples, I think.) However, let’s make sure we get rid of any samples with too few reads. We define the minimum number of reads per sample in sample_reads.

After we’ve removed any shallow samples, we’ll convert the OTU table to relative abundances. I’ll add a pseudo-count of 1e-6 to any zero entries, to avoid problems with taking logs.

remove_shallow_smpls <- function(df, n_reads) {
  # Removes samples with fewer than n_reads from dataframe df.
  # df has OTUs in rows and samples in columns
  return(df[, colSums(df) > n_reads])
  }

remove_shallow_otus <- function(df, n_reads){
  return(df[rowSums(df) > n_reads, ])
  }

## Remove OTUs with fewer than 10 reads
otu <- remove_shallow_otus(otu, 10)

# Remove samples with fewer than sample_Reads reads
sample_reads <- 100
otu <- remove_shallow_smpls(otu, sample_reads)

# Update metadata with new samples
meta <- meta %>% filter(get(col_label) %in% colnames(otu))

# Remove empty OTUs
otu <- otu[rowSums(otu) > 0, ]

# Convert to relative abundance
abun_otu <- t(t(otu) / rowSums(t(otu)))

# Add pseudo counts
#minabun <- min(abun_otu[abun_otu > 0]) # Use this to figure out what pseudo-count value to add
zeroabun <- 0
abun_otu <- abun_otu + zeroabun

Next, we need to calculate the pvalues, effect size, and standard error for each OTU. Here, we’ll compare diarrhea vs. healthy. Diarrhea will include both CDI and nonCDI patients. We’ll put these results into a dataframe, and label the columns with the standardized names for downstream use (pval, SE, effect_size, test_statistic). The test statistic is the one returned by wilcox.test().

Note that the effect here is calculated as logfold change of mean abundance in controls relative to cases (i.e. log(mean_abun[controls]/mean_abun[cases]))

While we’re at it, we’ll also calculate the mean abundance and ubiquity (detection rate) of each OTU. Later, we can assign their values to a new column called ind_covariate for use in downstream steps.

resfile <- paste0(resdir, "schubert_results_",
                  nrow(abun_otu), "_OTUs.RData")
if (!file.exists(resfile)){
  # Get case and control indices
  case_idx <- meta$DiseaseState %in% c("CDI", "nonCDI")
  ctrl_idx <- meta$DiseaseState %in% c("H")
  
  # Calculate pvalues, effects, and stderr
  pvals <- c()
  teststats <- c()
  ses <- c()
  effs <- c()
  mean_abuns <- c()
  mean_abuns_present <- c()
  ubis <- c()
  
  casedf <- abun_otu[, case_idx]
  ctrldf <- abun_otu[, ctrl_idx]
  for (o in rownames(abun_otu)) {
    # wilcoxon p value  
    w <- wilcox.test(casedf[o, ], 
                     ctrldf[o, ])
    p <- w$p.value
    teststat <- w$statistic
    
    pvals <- c(pvals, p)
    teststats <- c(teststats, teststat)
    
    # standard error of the OTU abundance, across all samples
    ses <- c(ses, 
             sd(abun_otu[o, ])/sqrt(length(abun_otu[o, ]))
             )
    
    # mean OTU abundance across all samples (after removing pseudo-count)
    mean_abuns <- c(mean_abuns, 
                    mean(abun_otu[o, ] - zeroabun)
                    )
    
    # mean OTU abundance across only samples with the OTU present
    mean_abuns_present <- c(mean_abuns_present, 
                            sum(abun_otu[o, ] - zeroabun) / sum(abun_otu[o, ] > zeroabun)
                            )
    
    # ubiquity of OTU across all samples
    ubis <- c(ubis, 
              sum(abun_otu[o, ] > zeroabun) / length(abun_otu[o, ])
              )
    
    # effect (logfold difference)
    effs <- c(effs, 
              log(mean(abun_otu[o, ctrl_idx])/mean(abun_otu[o, case_idx]))
              )
  
  }
  
  res <- data.frame(otu = rownames(abun_otu), 
                    pval = pvals, wilcox_teststat = teststats,
                    SE = ses, effect_size = effs, 
                    mean_abun = mean_abuns, mean_abun_present = mean_abuns_present,
                    ubiquity = ubis)
  res <- res %>%
    mutate(test_statistic = qnorm(exp(log(pval)-log(2)), lower.tail=FALSE) * sign(effect_size))

  save(res, file=resfile)
  
}else{
  load(resfile)
}
head(res)
##                                                                                                                       otu
## 1 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Porphyromonadaceae;g__Parabacteroides;s__;d__denovo1106
## 2         k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__;d__denovo3059
## 3         k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__;d__denovo3058
## 4         k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__;d__denovo3051
## 5                k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Blautia;s__;d__denovo3050
## 6         k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__;d__denovo3053
##           pval wilcox_teststat           SE effect_size    mean_abun
## 1 0.0007407769         12489.0 5.741686e-06   0.6398794 2.706130e-05
## 2 0.0012890518         13095.5 2.648649e-06   1.5905094 7.890719e-06
## 3 0.0234969371         13342.0 2.192590e-06   1.1674276 7.703083e-06
## 4 0.1413667404         13594.0 2.338025e-06   0.7680453 7.735577e-06
## 5 0.2279200700         14297.0 5.895746e-06  -1.7274118 1.221058e-05
## 6 0.0409601397         13430.5 2.570857e-06   0.7198081 8.241867e-06
##   mean_abun_present   ubiquity test_statistic
## 1      0.0002841437 0.09523810       3.374025
## 2      0.0002209401 0.03571429       3.218406
## 3      0.0001990951 0.03869048       2.265257
## 4      0.0002165962 0.03571429       1.470720
## 5      0.0005128445 0.02380952      -1.205734
## 6      0.0002307723 0.03571429       2.043933

Finally, let’s try to add phylogeny as covariates. Here we’ll have columns for each separate taxonomic level.

res <- res %>% separate(otu, 
                        c("kingdom", "phylum", "class", "order", "family", "genus", "species", "denovo"), 
                        sep=";", remove = FALSE)

Check Covariate Diagnostics

Here we look to see if the covariates do indeed look informative.

Ubiquity
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=20, binwidth=0.05)

rank_scatter(res, pvalue="pval", covariate="ubiquity")

Mean abundance (across non-zero samples)
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)

rank_scatter(res, pvalue="pval", covariate="mean_abun_present")

Phylogeny

Let’s look at phylum-level stratification first. A priori, I might expect Proteobacteria to be enriched for low p-values? But I don’t know if that’s super legit, and Eric doesn’t seem to think that phylogeny will be informative at all…

#strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=20)
#rank_scatter(res, pvalue="pval", covariate="ubiquity")
ggplot(res, aes(x=pval)) + geom_histogram() + facet_wrap(~phylum, scales = "free")

Let’s use ubiquity as our ind_covariate

res <- res %>% mutate(ind_covariate = ubiquity)

Also (hacky), remove test_statistic so that ash doesn’t run.

res <- res %>% 
  mutate(test_statistic = NA) %>%
  mutate(effect_size = NA)

Set up BenchDesign object

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We’ll do this for a few values of nmids, to explore its effects on the outputs.

Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

for (nmids in c(5, 50, 150)){
  print(nmids)
  bd <- initializeBenchDesign(nmids=nmids)
  
  resfile <- paste0(resdir, "schubert_summarizedBenchmark_",
                    nrow(res), "_nmids_", nmids, ".RData")
  duration <- NA
  if (!file.exists(resfile)){
    t1 <- proc.time()
    sb <- bd %>% buildBench(data=res, ftCols = "ind_covariate")
                            #parallel=TRUE, BPPARAM=multicoreParam)
    metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/cdi_schubert_results.tar.gz"
    save(sb, file=resfile)
    duration <- round((proc.time()-t1)[3]/60,1)
  }else{
    load(resfile)
  }
}

Just for future runs, where I load the file instead of re-calculating it, here are the errors I got:

[1] 5
Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !NaNs produced<simpleError in if (sum(D^2) > qchisq(0.9, nmids - 2 - df)) {    warning(paste0("f(z) misfit = ", round(D, 1), ".  Rerun with increased df."))}: missing value where TRUE/FALSE needed>

[1] 50
Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !f(z) misfit = -21.6.  Rerun with increased df.f(z) misfit = -22.5.  Rerun with increased df.f(z) misfit = -10.  Rerun with increased df.f(z) misfit = 32.  Rerun with increased df.f(z) misfit = 14.  Rerun with increased df.f(z) misfit = -32.2.  Rerun with increased df.f(z) misfit = 6.9.  Rerun with increased df.f(z) misfit = -1.3.  Rerun with increased df.f(z) misfit = 0.7.  Rerun with increased df.f(z) misfit = 3.4.  Rerun with increased df.f(z) misfit = 5.5.  Rerun with increased df.f(z) misfit = 0.1.  Rerun with increased df.f(z) misfit = -0.9.  Rerun with increased df.f(z) misfit = -1.5.  Rerun with increased df.f(z) misfit = -1.9.  Rerun with increased df.f(z) misfit = -0.8.  Rerun with increased df.f(z) misfit = -2.4.  Rerun with increased df.f(z) misfit = 1.7.  Rerun with increased df.f(z) misfit = 0.5.  Rerun with increased df.f(z) misfit = 2.  Rerun with increased df.f(z) misfit = 0.1.  Rerun with increased df.f(z) misfit = 1.4.  Rerun with increased df.f(z) misfit = 0.2.  Reru... <truncated>

[1] 150
Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !f(z) misfit = 83.4.  Rerun with increased df.f(z) misfit = -8.5.  Rerun with increased df.f(z) misfit = -13.6.  Rerun with increased df.f(z) misfit = -17.1.  Rerun with increased df.f(z) misfit = -6.4.  Rerun with increased df.f(z) misfit = -19.5.  Rerun with increased df.f(z) misfit = -17.8.  Rerun with increased df.f(z) misfit = -3.  Rerun with increased df.f(z) misfit = -21.1.  Rerun with increased df.f(z) misfit = 28.6.  Rerun with increased df.f(z) misfit = -26.  Rerun with increased df.f(z) misfit = 59.3.  Rerun with increased df.f(z) misfit = -17.1.  Rerun with increased df.f(z) misfit = 12.1.  Rerun with increased df.f(z) misfit = -28.3.  Rerun with increased df.f(z) misfit = -27.6.  Rerun with increased df.f(z) misfit = 83.7.  Rerun with increased df.f(z) misfit = -10.5.  Rerun with increased df.f(z) misfit = -22.6.  Rerun with increased df.f(z) misfit = -23.3.  Rerun with increased df.f(z) misfit = 37.3.  Rerun with increased df.f(z) misfit = -18.  Rerun with increased df.f(z... <truncated>>

Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.

# rename assay to qvalue
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Now, we’ll plot the results.

Debugging note: to look at the matrix of qvalues, run assays(sb)[["qvalue"]]

# plot nrejects by method overall and stratified by covariate
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
## Warning: Removed 40 rows containing missing values (geom_point).

rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
## Warning: Removed 160 rows containing missing values (geom_point).

plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )

Hm, now the code runs. However, there are clearly still some issues: - ashs rejects all hypotheses (all q-values are essentially 0). - lfdr and scott-empirical are all NaN (I think this is likely related to the df error)

methods <- c( "lfdr", "ihw-a10", "bl-df03", "qvalue", "bh", "bonf" )
plotCovariateBoxplots( sb, alpha=0.1, nsets=6, methods=methods)

assays(sb)[["qvalue"]]["ashs"] %>% max()
sum(is.na(assays(sb)[["qvalue"]]["lfdr"]))
sum(is.na(assays(sb)[["qvalue"]]["scott-empirical"]))
sum(is.na(assays(sb)[["qvalue"]]["scott-theoretical"]))

Plotting methods are giving errors for some reason. Let’s try to use Alejandro’s code instead.

And we’ll clean up the workspace before moving on to the next dataset.

rm(res)
rm(bd)
rm(sb)
rm(pf)

OB Goodrich - obesity dataset

Let’s repeat these analyses for the OB Goodrich dataset.

Here we download and analyze the CDI Schubert diarrhea (C. diff and non-C.diff diarrhea) dataset. We’ll download the processed OTU tables from Zenodo and unzip them in the benchmark-fdr/datasets/microbiome/DATA folder.

Data download

cd /Users/claire/github/benchmark-fdr/datasets/microbiome/DATA/
curl -O "https://zenodo.org/record/840333/files/ob_goodrich_results.tar.gz"
tar -xzvf ob_goodrich_results.tar.gz
cd ..

Next, we’ll read in the unzipped OTU table and metadata files into R.

We’ll need to manually define which disease labels are of interest and what metadata column contains the sample IDs (i.e. the IDs in the OTU table).

otu_path = paste0(datdir, "ob_goodrich_results/RDP/ob_goodrich.otu_table.100.denovo.rdp_assigned")
meta_path = paste0(datdir, "ob_goodrich_results/ob_goodrich.metadata.txt")

# DiseaseState labels to keep
labels <- c("H", "OB")

# Metadata column that has sample IDs
col_label <- "X"

# load OTU table and metadata
otu <- read.table(otu_path)
meta <- read.csv(meta_path, sep='\t')

# Keep only samples with the right DiseaseState metadata
meta <- meta %>% filter(DiseaseState %in% labels)

# Keep only samples with both metadata and 16S data
keep_samples <- intersect(colnames(otu), meta[, 1])
otu <- otu[, keep_samples]
meta <- meta %>% filter(get(col_label) %in% keep_samples)

Since we’ll be using OTU-wise covariates, we shouldn’t need to perform any filtering/cleaning of the OTUs, apart from removing any that are all zeros. (This may happen after removing shallow samples, I think.) However, let’s make sure we get rid of any samples with too few reads. We define the minimum number of reads per sample in sample_reads.

After we’ve removed any shallow samples, we’ll convert the OTU table to relative abundances. I’ll add a pseudo-count of 1e-6 to any zero entries, to avoid problems with taking logs.

remove_shallow_smpls <- function(df, n_reads) {
  # Removes samples with fewer than n_reads from dataframe df.
  # df has OTUs in rows and samples in columns
  return(df[, colSums(df) > n_reads])
}

remove_shallow_otus <- function(df, n_reads){
  return(df[rowSums(df) > n_reads, ])
}

remove_rare_otus <- function(df, perc_samples){
  return(df[rowSums(df > 0) / dim(df)[2] > perc_samples, ])
}

## Remove OTUs with fewer than 10 reads
otu <- remove_shallow_otus(otu, 10)
print(dim(otu))
## [1] 46846   644
## Remove OTUs in fewer than 1% of samples
otu <- remove_rare_otus(otu, 0.01)
print(dim(otu))
## [1] 42388   644
# Remove samples with fewer than sample_reads reads
sample_reads <- 100
otu <- remove_shallow_smpls(otu, sample_reads)

# Update metadata with new samples
meta <- meta %>% filter(get(col_label) %in% colnames(otu))

# Remove empty OTUs
otu <- otu[rowSums(otu) > 0, ]

# Convert to relative abundance
abun_otu <- t(t(otu) / rowSums(t(otu)))

# Add pseudo counts
#minabun <- min(abun_otu[abun_otu > 0]) # Use this to figure out what pseudo-count value to add
zeroabun <- 0
abun_otu <- abun_otu + zeroabun

Next, we need to calculate the pvalues, effect size, and standard error for each OTU. Here, we’ll compare lean vs. obese. We’ll put these results into a dataframe, and label the columns with the standardized names for downstream use (pval, SE, effect_size, test_statistic). The test statistic is the one returned by wilcox.test().

Note that the effect here is calculated as logfold change of mean abundance in controls relative to cases (i.e. log(mean_abun[controls]/mean_abun[cases]))

While we’re at it, we’ll also calculate the mean abundance and ubiquity (detection rate) of each OTU. Later, we can assign their values to a new column called ind_covariate for use in downstream steps.

Note that OB Goodrich has ~70,000 OTUs, so this is going to take a while. There are probably faster ways to write this, but I don’t know them.

dim(abun_otu)
## [1] 42388   644
resfile <- paste0(resdir, "goodrich_results_",
                  nrow(abun_otu), "_OTUs.RData")
if (!file.exists(resfile)){
  # Get case and control indices
  case_idx <- meta$DiseaseState %in% c("OB")
  ctrl_idx <- meta$DiseaseState %in% c("H")
  
  # Calculate pvalues, effects, and stderr
  pvals <- c()
  teststats <- c()
  ses <- c()
  effs <- c()
  mean_abuns <- c()
  mean_abuns_present <- c()
  ubis <- c()
  
  casedf <- abun_otu[, case_idx]
  ctrldf <- abun_otu[, ctrl_idx]
  for (o in rownames(abun_otu)) {
    # wilcoxon p value  
    w <- wilcox.test(casedf[o, ], 
                     ctrldf[o, ])
    p <- w$p.value
    teststat <- w$statistic
    
    pvals <- c(pvals, p)
    teststats <- c(teststats, teststat)
    
    # standard error of the OTU abundance, across all samples
    ses <- c(ses, 
             sd(abun_otu[o, ])/sqrt(length(abun_otu[o, ]))
             )
    
    # mean OTU abundance across all samples (after removing pseudo-count)
    mean_abuns <- c(mean_abuns, 
                    mean(abun_otu[o, ] - zeroabun)
                    )
    
    # mean OTU abundance across only samples with the OTU present
    mean_abuns_present <- c(mean_abuns_present, 
                            sum(abun_otu[o, ] - zeroabun) / sum(abun_otu[o, ] > zeroabun)
                            )
    
    # ubiquity of OTU across all samples
    ubis <- c(ubis, 
              sum(abun_otu[o, ] > zeroabun) / length(abun_otu[o, ])
              )
    
    # effect (logfold difference)
    effs <- c(effs, 
              log(mean(abun_otu[o, ctrl_idx])/mean(abun_otu[o, case_idx]))
              )
  
  }
  
  res <- data.frame(otu = rownames(abun_otu), 
                    pval = pvals, test_statistic = teststats,
                    SE = ses, effect_size = effs, 
                    mean_abun = mean_abuns, mean_abun_present = mean_abuns_present,
                    ubiquity = ubis)
  res <- res %>%
    mutate(test_statistic = qnorm(exp(log(pval)-log(2)), lower.tail=FALSE) * sign(effect_size))
  
  save(res, file=resfile)
  
}else{
  load(resfile)
}
head(res)
##                                                                                                                                         otu
## 1              k__Bacteria;p__Firmicutes;c__Erysipelotrichia;o__Erysipelotrichales;f__Erysipelotrichaceae;g__Turicibacter;s__;d__denovo7357
## 2 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia/Shigella;s__;d__denovo5395
## 3                        k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Faecalibacterium;s__;d__denovo22216
## 4                          k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__;d__denovo11290
## 5                                         k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__;s__;d__denovo5133
## 6                           k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae;g__Prevotella;s__;d__denovo15141
##        pval test_statistic           SE effect_size    mean_abun
## 1 0.4989316     0.67617169 1.071357e-06  0.54599655 5.763634e-06
## 2 0.6519241     0.45109074 1.077522e-06  0.72335916 5.804411e-06
## 3 0.2759928    -1.08936527 4.151301e-07 -0.43873960 1.271709e-06
## 4 0.8367976     0.20599138 1.166077e-06  1.38015692 2.128078e-06
## 5 0.6461966     0.45905233 1.839339e-06  0.28805353 9.347246e-06
## 6 0.9834296    -0.02076936 6.037991e-07 -0.06963892 1.911190e-06
##   mean_abun_present   ubiquity
## 1      6.186300e-05 0.09316770
## 2      6.922298e-05 0.08385093
## 3      3.899908e-05 0.03260870
## 4      1.245893e-04 0.01708075
## 5      7.431637e-05 0.12577640
## 6      6.154032e-05 0.03105590

Finally, let’s try to add phylogeny as covariates. Here we’ll have columns for each separate taxonomic level.

res <- res %>% separate(otu, 
                        c("kingdom", "phylum", "class", "order", "family", "genus", "species", "denovo"), 
                        sep=";", remove = FALSE)

Check Covariate Diagnostics

Here we look to see if the covariates do indeed look informative.

Ubiquity
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=5, binwidth=0.05, numQ=4)

rank_scatter(res, pvalue="pval", covariate="ubiquity")

Mean abundance (across non-zero samples)
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=5, binwidth=0.05, numQ=4)

rank_scatter(res, pvalue="pval", covariate="mean_abun_present")

Phylogeny

Let’s look at phylum-level stratification first. A priori, I might expect Proteobacteria to be enriched for low p-values? But I don’t know if that’s super legit, and Eric doesn’t seem to think that phylogeny will be informative at all…

#strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=20)
#rank_scatter(res, pvalue="pval", covariate="ubiquity")
ggplot(res, aes(x=pval)) + geom_histogram() + facet_wrap(~phylum, scales = "free")

Let’s use ubiquity as our ind_covariate

res <- res %>% mutate(ind_covariate = ubiquity)

Also (hacky), remove test_statistic so that ash doesn’t run.

res <- res %>% 
  mutate(test_statistic = NA) %>%
  mutate(effect_size = NA)

Set up BenchDesign object

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object.

Next, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

for (nmids in c(5, 50, 150)){
  print(nmids)
  bd <- initializeBenchDesign(nmids=nmids)
  
  resfile <- paste0(resdir, "goodrich_summarizedBenchmark_",
                    nrow(res), "_nmids_", nmids, ".RData")
  duration <- NA
  if (!file.exists(resfile)){
    t1 <- proc.time()
    sb <- bd %>% buildBench(data=res, ftCols = "ind_covariate")
                            #parallel=TRUE, BPPARAM=multicoreParam)
    metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/cdi_schubert_results.tar.gz"
    save(sb, file=resfile)
    duration <- round((proc.time()-t1)[3]/60,1)
  }else{
    load(resfile)
  }
}

The outputs of this call are:

[1] 5
Squarem-1
Objective fn: 715262
Objective fn: 104929  Extrapolation: 0  Steplength: 1
Objective fn: 95918.9  Extrapolation: 1  Steplength: 2.44204
Objective fn: 95614.3  Extrapolation: 1  Steplength: 2.72536
Objective fn: 95583.3  Extrapolation: 1  Steplength: 2.8568
Objective fn: 95578.1  Extrapolation: 1  Steplength: 3.54373
Objective fn: 95576.3  Extrapolation: 1  Steplength: 3.52004
Objective fn: 95575.4  Extrapolation: 1  Steplength: 4
.
.
.
NaNs produced<simpleError in if (sum(D^2) > qchisq(0.9, nmids - 2 - df)) {    warning(paste0("f(z) misfit = ", round(D, 1), ".  Rerun with increased df."))}: missing value where TRUE/FALSE needed>

[1] 50
Squarem-1
Objective fn: 715262
Objective fn: 104929  Extrapolation: 0  Steplength: 1
Objective fn: 95918.9  Extrapolation: 1  Steplength: 2.44204
Objective fn: 95614.3  Extrapolation: 1  Steplength: 2.72536
Objective fn: 95583.3  Extrapolation: 1  Steplength: 2.8568
Objective fn: 95578.1  Extrapolation: 1  Steplength: 3.54373
Objective fn: 95576.3  Extrapolation: 1  Steplength: 3.52004
Objective fn: 95575.4  Extrapolation: 1  Steplength: 4
Objective fn: 95574.9  Extrapolation: 1  Steplength: 3.36769
Objective fn: 95574.5  Extrapolation: 1  Steplength: 5.09115
Objective fn: 95574.3  Extrapolation: 1  Steplength: 2.52066
Objective fn: 95574  Extrapolation: 1  Steplength: 13.3966
.
.
.
Objective fn: 95572.9  Extrapolation: 1  Steplength: 1.61619
f(z) misfit = -8.7.  Rerun with increased df.f(z) misfit = 33.3.  Rerun with increased df.f(z) misfit = -9.4.  Rerun with increased df.f(z) misfit = -37.3.  Rerun with increased df.f(z) misfit = 49.2.  Rerun with increased df.f(z) misfit = -36.9.  Rerun with increased df.f(z) misfit = 10.6.  Rerun with increased df.f(z) misfit = 13.3.  Rerun with increased df.f(z) misfit = -18.3.  Rerun with increased df.f(z) misfit = 20.1.  Rerun with increased df.f(z) misfit = -21.6.  Rerun with increased df.f(z) misfit = 23.4.  Rerun with increased df.f(z) misfit = -27.  Rerun with increased df.f(z) misfit = 17.2.  Rerun with increased df.f(z) misfit = -8.2.  Rerun with increased df.f(z) misfit = 4.3.  Rerun with increased df.f(z) misfit = 3.4.  Rerun with increased df.f(z) misfit = 2.4.  Rerun with increased df.f(z) misfit = -2.3.  Rerun with increased df.f(z) misfit = -10.7.  Rerun with increased df.f(z) misfit = -2.4.  Rerun with increased df.f(z) misfit = 15.4.  Rerun with increased df.f(z) misf... <truncated>

[1] 150
Squarem-1
Objective fn: 715262
Objective fn: 104929  Extrapolation: 0  Steplength: 1
Objective fn: 95918.9  Extrapolation: 1  Steplength: 2.44204
Objective fn: 95614.3  Extrapolation: 1  Steplength: 2.72536
Objective fn: 95583.3  Extrapolation: 1  Steplength: 2.8568
Objective fn: 95578.1  Extrapolation: 1  Steplength: 3.54373
Objective fn: 95576.3  Extrapolation: 1  Steplength: 3.52004
Objective fn: 95575.4  Extrapolation: 1  Steplength: 4
Objective fn: 95574.9  Extrapolation: 1  Steplength: 3.36769
Objective fn: 95574.5  Extrapolation: 1  Steplength: 5.09115
Objective fn: 95574.3  Extrapolation: 1  Steplength: 2.52066
Objective fn: 95574  Extrapolation: 1  Steplength: 13.3966
.
.
.
f(z) misfit = -20.2.  Rerun with increased df.f(z) misfit = 13.6.  Rerun with increased df.f(z) misfit = -3.8.  Rerun with increased df.f(z) misfit = 9.  Rerun with increased df.f(z) misfit = -21.2.  Rerun with increased df.f(z) misfit = 28.8.  Rerun with increased df.f(z) misfit = 41.9.  Rerun with increased df.f(z) misfit = -13.2.  Rerun with increased df.f(z) misfit = -12.8.  Rerun with increased df.f(z) misfit = 15.2.  Rerun with increased df.f(z) misfit = -18.1.  Rerun with increased df.f(z) misfit = -19.1.  Rerun with increased df.f(z) misfit = -18.6.  Rerun with increased df.f(z) misfit = -26.  Rerun with increased df.f(z) misfit = -7.2.  Rerun with increased df.f(z) misfit = 67.1.  Rerun with increased df.f(z) misfit = 27.2.  Rerun with increased df.f(z) misfit = -16.4.  Rerun with increased df.f(z) misfit = -21.1.  Rerun with increased df.f(z) misfit = -25.9.  Rerun with increased df.f(z) misfit = -7.5.  Rerun with increased df.f(z) misfit = 10.1.  Rerun with increased df.f(z)... <truncated>

Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.

# rename assay to qvalue
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Now, we’ll plot the results.

Debugging note: to look at the matrix of qvalues, run assays(sb)[["qvalue"]]

# plot nrejects by method overall and stratified by covariate
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
## Warning: Removed 30 rows containing missing values (geom_point).

rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
## Warning: Removed 120 rows containing missing values (geom_point).

#plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE )

There is no upsetR plot for OB Goodrich for now because only lfdr is returning any rejections.

assays(sb)[["qvalue"]][, "ashs"] %>% max()
sum(is.na(assays(sb)[["qvalue"]][, "scott-empirical"]))
sum(is.na(assays(sb)[["qvalue"]][, "lfdr"]))

Goodrich, genus-level analysis

## Collapse baxter OTU table to genus level

# Add column with otu names
genus_df <- as.data.frame(abun_otu)
genus_df$otu <- rownames(genus_df)
# Melt into tidy format
genus_df <- genus_df %>% melt(id.var="otu", variable.name="sample", value.name="abun")
# Split into separate taxonomic levels
genus_df <- genus_df %>% 
  separate(otu, c("kingdom", "phylum", "class", "order", "family", "genus", "species", "denovo"), 
           sep=";", remove = FALSE)

# Get rid of unannoated genera, and sum abundances for genera
genus_df <- genus_df %>% 
  filter(genus != "g__") %>% 
  group_by(genus, sample) %>% 
  summarise(total_abun = sum(abun)) 

# Convert back to longform
genus_df <- genus_df %>% spread(genus, total_abun) %>% as.data.frame
rownames(genus_df) <- genus_df$sample
genus_df <- genus_df %>% select(-sample) %>% t

# And re-order columns to match metadata
genus_df <- genus_df[, match(meta$X, colnames(genus_df))]
resfile <- paste0(resdir, "goodrich_results_",
                  nrow(genus_df), "_genera.RData")
if (!file.exists(resfile)){
  # Get case and control indices
  case_idx <- meta$DiseaseState %in% c("OB")
  ctrl_idx <- meta$DiseaseState %in% c("H")
  
  # Calculate pvalues, effects, and stderr
  pvals <- c()
  teststats <- c()
  ses <- c()
  effs <- c()
  mean_abuns <- c()
  mean_abuns_present <- c()
  ubis <- c()
  
  casedf <- genus_df[, case_idx]
  ctrldf <- genus_df[, ctrl_idx]
  for (o in rownames(genus_df)) {
    # wilcoxon p value  
    w <- wilcox.test(casedf[o, ], 
                     ctrldf[o, ])
    p <- w$p.value
    teststat <- w$statistic
    
    pvals <- c(pvals, p)
    teststats <- c(teststats, teststat)
    
    # standard error of the OTU abundance, across all samples
    ses <- c(ses, 
             sd(genus_df[o, ])/sqrt(length(genus_df[o, ]))
             )
    
    # mean OTU abundance across all samples (after removing pseudo-count)
    mean_abuns <- c(mean_abuns, 
                    mean(genus_df[o, ] - zeroabun)
                    )
    
    # mean OTU abundance across only samples with the OTU present
    mean_abuns_present <- c(mean_abuns_present, 
                            sum(genus_df[o, ] - zeroabun) / sum(genus_df[o, ] > zeroabun)
                            )
    
    # ubiquity of OTU across all samples
    ubis <- c(ubis, 
              sum(genus_df[o, ] > zeroabun) / length(genus_df[o, ])
              )
    
    # effect (logfold difference)
    effs <- c(effs, 
              log(mean(genus_df[o, ctrl_idx])/mean(genus_df[o, case_idx]))
              )
  
  }
  
  res <- data.frame(otu = rownames(genus_df), 
                    pval = pvals, test_statistic = teststats,
                    SE = ses, effect_size = effs, 
                    mean_abun = mean_abuns, mean_abun_present = mean_abuns_present,
                    ubiquity = ubis)
  res <- res %>%
    mutate(test_statistic = qnorm(exp(log(pval)-log(2)), lower.tail=FALSE) * sign(effect_size))
  
  save(res, file=resfile)
  
}else{
  load(resfile)
}

Check Covariate Diagnostics

Here we look to see if the covariates do indeed look informative.

Ubiquity
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=10, binwidth=0.05)

rank_scatter(res, pvalue="pval", covariate="ubiquity")

Something weird is happening with p=0.20 and p=0.40 - maybe this has something to do with ties? Either way, ubiquity looks to be a bit informative (from the scatter plot, but not really the histograms…)

Mean abundance (across non-zero samples)
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)

rank_scatter(res, pvalue="pval", covariate="mean_abun_present")

Let’s use ubiquity as our ind_covariate

res <- res %>% mutate(ind_covariate = ubiquity)

Also (hacky), remove test_statistic so that ash doesn’t run.

res <- res %>% 
  mutate(test_statistic = NA) %>%
  mutate(effect_size = NA)

Set up BenchDesign object

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We’ll do this for a few values of nmids, to explore its effects on the outputs.

Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

for (nmids in c(5, 50, 150)){
  print(nmids)
  bd <- initializeBenchDesign(nmids=nmids)
  
  resfile <- paste0(resdir, "goodrich_summarizedBenchmark_",
                    nrow(res), "_genera_nmids_", nmids, ".RData")
  duration <- NA
  if (!file.exists(resfile)){
    t1 <- proc.time()
    sb <- bd %>% buildBench(data=res, ftCols = "ind_covariate")
                            #parallel=TRUE, BPPARAM=multicoreParam)
    metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
    save(sb, file=resfile)
    duration <- round((proc.time()-t1)[3]/60,1)
  }else{
    load(resfile)
  }
}

Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.

# rename assay to qvalue
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Now, we’ll plot the results.

Debugging note: to look at the matrix of qvalues, run assays(sb)[["qvalue"]]

# plot nrejects by method overall and stratified by covariate
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
## Warning: Removed 30 rows containing missing values (geom_point).

rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
## Warning: Removed 120 rows containing missing values (geom_point).

#plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )

Note: Benjamini-Hochberg (bh) overlaps exactly with the IHW results.

plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )

methods <- c( "lfdr", "ihw-a10", "bl-df03", "qvalue", "bh", "bonf" )
plotCovariateBoxplots( sb, alpha=0.1, nsets=6, methods=methods)

CRC Baxter - colorectal cancer

Next, I’ll look at the CRC Baxter dataset. For colorectal cancer, we do expect some amount of truly differentially abundant OTUs, but not as many as in diarrhea. This dataset will hopefully provide an intermediate non-extreme case study. We’ll download the processed OTU tables from Zenodo and unzip them in the benchmark-fdr/datasets/microbiome/DATA folder.

Data download

cd /Users/claire/github/benchmark-fdr/datasets/microbiome/DATA/
curl -O "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
tar -xzvf crc_baxter_results.tar.gz
cd ..

Next, we’ll read in the unzipped OTU table and metadata files into R.

We’ll need to manually define which disease labels are of interest and what metadata column contains the sample IDs (i.e. the IDs in the OTU table).

otu_path = paste0(datdir, "crc_baxter_results/RDP/crc_baxter.otu_table.100.denovo.rdp_assigned")
meta_path = paste0(datdir, "crc_baxter_results/crc_baxter.metadata.txt")

# DiseaseState labels to keep
labels <- c("H", "CRC")

# Metadata column that has sample IDs
col_label <- "Sample_Name_s"

# load OTU table and metadata
otu <- read.table(otu_path)
meta <- read.csv(meta_path, sep='\t')

# Keep only samples with the right DiseaseState metadata
meta <- meta %>% filter(DiseaseState %in% labels)

# Add "X" in front of sample IDs because of how R read in the OTU table
meta[, 1] <- sub("^", "X", meta[, 1])

# Keep only samples with both metadata and 16S data
keep_samples <- intersect(colnames(otu), meta[, 1])
otu <- otu[, keep_samples]
meta <- meta %>% filter(get(col_label) %in% keep_samples)

A brief aside: what’s the count distribution of these OTUs?

otu %>% colSums %>% hist()

Since we’ll be using OTU-wise covariates, we shouldn’t need to perform any filtering/cleaning of the OTUs, apart from removing any that are all zeros. (This may happen after removing shallow samples, I think.) However, let’s make sure we get rid of any samples with too few reads. We define the minimum number of reads per sample in sample_reads.

After we’ve removed any shallow samples, we’ll convert the OTU table to relative abundances. I’ll add a pseudo-count of 1e-7 to any zero entries, to avoid problems with taking logs.

remove_shallow_smpls <- function(df, n_reads) {
  # Removes samples with fewer than n_reads from dataframe df.
  # df has OTUs in rows and samples in columns
  return(df[, colSums(df) >= n_reads])
  }

remove_shallow_otus <- function(df, n_reads){
  return(df[rowSums(df) > n_reads, ])
  }

remove_rare_otus <- function(df, perc_samples){
  return(df[rowSums(df > 0) / dim(df)[2] > perc_samples, ])
}

## Remove OTUs with fewer than 10 reads
otu <- remove_shallow_otus(otu, 10)
print(dim(otu))
## [1] 10381   292
## Remove OTUs in fewer than 1% of samples
otu <- remove_rare_otus(otu, 0.01)
print(dim(otu))
## [1] 9388  292
# Remove samples with fewer than sample_Reads reads
sample_reads <- 100
otu <- remove_shallow_smpls(otu, sample_reads)
print(dim(otu))
## [1] 9388  292
# Update metadata with new samples
meta <- meta %>% filter(get(col_label) %in% colnames(otu))

# Remove empty OTUs
otu <- otu[rowSums(otu) > 0, ]
print(dim(otu))
## [1] 9388  292
# Convert to relative abundance
abun_otu <- t(t(otu) / rowSums(t(otu)))

# Add pseudo counts
#minabun <- min(abun_otu[abun_otu > 0]) # Use this to figure out what pseudo-count value to add
zeroabun <- 0
abun_otu <- abun_otu + zeroabun

Next, we need to calculate the pvalues, effect size, and standard error for each OTU. Here, we’ll compare CRC vs. healthy. We won’t consider the nonCRC adenoma patients. We’ll put these results into a dataframe, and label the columns with the standardized names for downstream use (pval, SE, effect_size, test_statistic). The test statistic is the one returned by wilcox.test().

Note that the effect here is calculated as logfold change of mean abundance in controls relative to cases (i.e. log(mean_abun[controls]/mean_abun[cases]))

While we’re at it, we’ll also calculate the mean abundance and ubiquity (detection rate) of each OTU. Later, we can assign their values to a new column called ind_covariate for use in downstream steps.

resfile <- paste0(resdir, "baxter_results_",
                  nrow(abun_otu), "_OTUs.RData")
if (!file.exists(resfile)){
  # Get case and control indices
  case_idx <- meta$DiseaseState %in% c("CRC")
  ctrl_idx <- meta$DiseaseState %in% c("H")
  
  # Calculate pvalues, effects, and stderr
  pvals <- c()
  teststats <- c()
  ses <- c()
  effs <- c()
  mean_abuns <- c()
  mean_abuns_present <- c()
  ubis <- c()
  
  casedf <- abun_otu[, case_idx]
  ctrldf <- abun_otu[, ctrl_idx]
  for (o in rownames(abun_otu)) {
    # wilcoxon p value  
    w <- wilcox.test(casedf[o, ], 
                     ctrldf[o, ])
    p <- w$p.value
    teststat <- w$statistic
    
    pvals <- c(pvals, p)
    teststats <- c(teststats, teststat)
    
    # standard error of the OTU abundance, across all samples
    ses <- c(ses, 
             sd(abun_otu[o, ])/sqrt(length(abun_otu[o, ]))
             )
    
    # mean OTU abundance across all samples (after removing pseudo-count)
    mean_abuns <- c(mean_abuns, 
                    mean(abun_otu[o, ] - zeroabun)
                    )
    
    # mean OTU abundance across only samples with the OTU present
    mean_abuns_present <- c(mean_abuns_present, 
                            sum(abun_otu[o, ] - zeroabun) / sum(abun_otu[o, ] > zeroabun)
                            )
    
    # ubiquity of OTU across all samples
    ubis <- c(ubis, 
              sum(abun_otu[o, ] > zeroabun) / length(abun_otu[o, ])
              )
    
    # effect (logfold difference)
    effs <- c(effs, 
              log(mean(abun_otu[o, ctrl_idx])/mean(abun_otu[o, case_idx]))
              )
  
  }
  
  res <- data.frame(otu = rownames(abun_otu), 
                    pval = pvals, test_statistic = teststats,
                    SE = ses, effect_size = effs, 
                    mean_abun = mean_abuns, mean_abun_present = mean_abuns_present,
                    ubiquity = ubis)
  res <- res %>%
    mutate(test_statistic = qnorm(exp(log(pval)-log(2)), lower.tail=FALSE) * sign(effect_size))
  
  save(res, file=resfile)
  
}else{
  load(resfile)
}
head(res)
##                                                                                                                otu
## 1                 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__;s__;d__denovo723
## 2 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Faecalibacterium;s__;d__denovo722
## 3   k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Anaerostipes;s__;d__denovo12896
## 4   k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Ruminococcus2;s__;d__denovo3851
## 5  k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__;d__denovo3850
## 6                 k__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__;g__;s__;d__denovo3853
##        pval test_statistic           SE effect_size    mean_abun
## 1 0.6923277    -0.39569812 8.446136e-05 -1.05786804 2.156097e-04
## 2 0.4653056     0.73013832 1.806014e-05  0.07466401 1.148666e-04
## 3 0.1637003     1.39273386 8.197627e-07  1.64499330 2.049111e-06
## 4 0.9575059     0.05328364 1.089610e-05  0.32376756 1.634014e-05
## 5 0.1200770    -1.55445061 4.414220e-06 -1.35443862 1.059000e-05
## 6 0.3852794    -0.86820991 2.380873e-06 -1.98252197 4.049247e-06
##   mean_abun_present   ubiquity
## 1      3.147902e-03 0.06849315
## 2      3.568196e-04 0.32191781
## 3      5.983405e-05 0.03424658
## 4      9.542639e-04 0.01712329
## 5      1.236912e-04 0.08561644
## 6      2.364760e-04 0.01712329

Finally, let’s try to add phylogeny as covariates. Here we’ll have columns for each separate taxonomic level.

res <- res %>% separate(otu, 
                        c("kingdom", "phylum", "class", "order", "family", "genus", "species", "denovo"), 
                        sep=";", remove = FALSE)

Check Covariate Diagnostics

Here we look to see if the covariates do indeed look informative.

Ubiquity
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=10, binwidth=0.05)

rank_scatter(res, pvalue="pval", covariate="ubiquity")

Something weird is happening with p=0.20 and p=0.40 - maybe this has something to do with ties? Either way, ubiquity looks to be a bit informative (from the scatter plot, but not really the histograms…)

Mean abundance (across non-zero samples)
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)

rank_scatter(res, pvalue="pval", covariate="mean_abun_present")

Phylogeny

Let’s look at phylum-level stratification first. A priori, I might expect Proteobacteria to be enriched for low p-values? But I don’t know if that’s super legit, and Eric doesn’t seem to think that phylogeny will be informative at all…

#strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=20)
#rank_scatter(res, pvalue="pval", covariate="ubiquity")
ggplot(res, aes(x=pval)) + geom_histogram() + facet_wrap(~phylum, scales = "free")

Not really informative either…

Let’s use ubiquity as our ind_covariate

res <- res %>% mutate(ind_covariate = ubiquity)

Also (hacky), remove test_statistic so that ash doesn’t run.

res <- res %>% 
  mutate(test_statistic = NA) %>%
  mutate(effect_size = NA)

Set up BenchDesign object

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We’ll do this for a few values of nmids, to explore its effects on the outputs.

Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

for (nmids in c(5, 50, 150)){
  print(nmids)
  bd <- initializeBenchDesign(nmids=nmids)
  
  resfile <- paste0(resdir, "baxter_summarizedBenchmark_",
                    nrow(res), "_nmids_", nmids, ".RData")
  duration <- NA
  if (!file.exists(resfile)){
    t1 <- proc.time()
    sb <- bd %>% buildBench(data=res, ftCols = "ind_covariate")
                            #parallel=TRUE, BPPARAM=multicoreParam)
    metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
    save(sb, file=resfile)
    duration <- round((proc.time()-t1)[3]/60,1)
  }else{
    load(resfile)
  }
}

Just for future runs, where I load the file instead of re-calculating it, here are the errors I got:

[1] 5
Squarem-1
Objective fn: 944053
Objective fn: 111059  Extrapolation: 0  Steplength: 1
Objective fn: 92567.2  Extrapolation: 1  Steplength: 2.85015
Objective fn: 91136.8  Extrapolation: 1  Steplength: 2.70993
Objective fn: 90930.1  Extrapolation: 1  Steplength: 4
.
.
.
Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !NaNs produced<simpleError in if (sum(D^2) > qchisq(0.9, nmids - 2 - df)) {    warning(paste0("f(z) misfit = ", round(D, 1), ".  Rerun with increased df."))}: missing value where TRUE/FALSE needed>

[1] 50
Squarem-1
Objective fn: 944053
Objective fn: 111059  Extrapolation: 0  Steplength: 1
Objective fn: 92567.2  Extrapolation: 1  Steplength: 2.85015
Objective fn: 91136.8  Extrapolation: 1  Steplength: 2.70993
Objective fn: 90930.1  Extrapolation: 1  Steplength: 4
.
.
.
Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !f(z) misfit = -33.4.  Rerun with increased df.f(z) misfit = -40.5.  Rerun with increased df.f(z) misfit = 207.5.  Rerun with increased df.f(z) misfit = -33.  Rerun with increased df.f(z) misfit = -13.  Rerun with increased df.f(z) misfit = -35.  Rerun with increased df.f(z) misfit = -44.7.  Rerun with increased df.f(z) misfit = -38.4.  Rerun with increased df.f(z) misfit = -35.2.  Rerun with increased df.f(z) misfit = -59.5.  Rerun with increased df.f(z) misfit = 275.  Rerun with increased df.f(z) misfit = -30.1.  Rerun with increased df.f(z) misfit = -74.7.  Rerun with increased df.f(z) misfit = -87.9.  Rerun with increased df.f(z) misfit = -85.1.  Rerun with increased df.f(z) misfit = 268.2.  Rerun with increased df.f(z) misfit = -56.9.  Rerun with increased df.f(z) misfit = -39.4.  Rerun with increased df.f(z) misfit = -8.8.  Rerun with increased df.f(z) misfit = -35.6.  Rerun with increased df.f(z) misfit = -30.2.  Rerun with increased df.f(z) misfit = -14.3.  Rerun with increased ... <truncated>

[1] 150
Squarem-1
Objective fn: 944053
Objective fn: 111059  Extrapolation: 0  Steplength: 1
Objective fn: 92567.2  Extrapolation: 1  Steplength: 2.85015
Objective fn: 91136.8  Extrapolation: 1  Steplength: 2.70993
Objective fn: 90930.1  Extrapolation: 1  Steplength: 4
Objective fn: 90882.9  Extrapolation: 0  Steplength: 1
.
.
.
Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !Censored sample for null model estimation has only size 0 !f(z) misfit = -15.9.  Rerun with increased df.f(z) misfit = 8.3.  Rerun with increased df.f(z) misfit = -10.8.  Rerun with increased df.f(z) misfit = -18.1.  Rerun with increased df.f(z) misfit = -24.6.  Rerun with increased df.f(z) misfit = -27.1.  Rerun with increased df.f(z) misfit = -24.6.  Rerun with increased df.f(z) misfit = -18.8.  Rerun with increased df.f(z) misfit = -21.4.  Rerun with increased df.f(z) misfit = 279.5.  Rerun with increased df.f(z) misfit = 83.1.  Rerun with increased df.f(z) misfit = -27.4.  Rerun with increased df.f(z) misfit = -28.3.  Rerun with increased df.f(z) misfit = -7.4.  Rerun with increased df.f(z) misfit = 11.6.  Rerun with increased df.f(z) misfit = -18.2.  Rerun with increased df.f(z) misfit = -23.  Rerun with increased df.f(z) misfit = -16.7.  Rerun with increased df.f(z) misfit = -24.1.  Rerun with increased df.f(z) misfit = -23.2.  Rerun with increased df.f(z) misfit = -24.4.  Rerun with increased df.f(z) misfit = -26.6.  Rerun with increase... <truncated>> 

Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.

# rename assay to qvalue
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Now, we’ll plot the results.

Debugging note: to look at the matrix of qvalues, run assays(sb)[["qvalue"]]

# plot nrejects by method overall and stratified by covariate
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
## Warning: Removed 30 rows containing missing values (geom_point).

rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
## Warning: Removed 120 rows containing missing values (geom_point).

#plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
assays(sb)[["qvalue"]][, "ashs"] %>% max()
sum(is.na(assays(sb)[["qvalue"]][, "scott-empirical"]))
sum(is.na(assays(sb)[["qvalue"]][, "lfdr"]))

Baxter, genus-level analysis

## Collapse baxter OTU table to genus level

# Add column with otu names
genus_df <- as.data.frame(abun_otu)
genus_df$otu <- rownames(genus_df)
# Melt into tidy format
genus_df <- genus_df %>% melt(id.var="otu", variable.name="sample", value.name="abun")
# Split into separate taxonomic levels
genus_df <- genus_df %>% 
  separate(otu, c("kingdom", "phylum", "class", "order", "family", "genus", "species", "denovo"), 
           sep=";", remove = FALSE)

# Get rid of unannoated genera, and sum abundances for genera
genus_df <- genus_df %>% 
  filter(genus != "g__") %>% 
  group_by(genus, sample) %>% 
  summarise(total_abun = sum(abun)) 

# Convert back to longform
genus_df <- genus_df %>% spread(genus, total_abun) %>% as.data.frame
rownames(genus_df) <- genus_df$sample
genus_df <- genus_df %>% select(-sample) %>% t

# And re-order columns to match metadata
genus_df <- genus_df[, match(meta$Sample_Name_s, colnames(genus_df))]
resfile <- paste0(resdir, "baxter_results_",
                  nrow(genus_df), "_genera.RData")
if (!file.exists(resfile)){
  # Get case and control indices
  case_idx <- meta$DiseaseState %in% c("CRC")
  ctrl_idx <- meta$DiseaseState %in% c("H")
  
  # Calculate pvalues, effects, and stderr
  pvals <- c()
  teststats <- c()
  ses <- c()
  effs <- c()
  mean_abuns <- c()
  mean_abuns_present <- c()
  ubis <- c()
  
  casedf <- genus_df[, case_idx]
  ctrldf <- genus_df[, ctrl_idx]
  for (o in rownames(genus_df)) {
    # wilcoxon p value  
    w <- wilcox.test(casedf[o, ], 
                     ctrldf[o, ])
    #w <- kruskal.test(list(casedf[o, ], ctrldf[o, ]))
    p <- w$p.value
    teststat <- w$statistic
    
    pvals <- c(pvals, p)
    teststats <- c(teststats, teststat)
    
    # standard error of the OTU abundance, across all samples
    ses <- c(ses, 
             sd(genus_df[o, ])/sqrt(length(genus_df[o, ]))
             )
    
    # mean OTU abundance across all samples (after removing pseudo-count)
    mean_abuns <- c(mean_abuns, 
                    mean(genus_df[o, ] - zeroabun)
                    )
    
    # mean OTU abundance across only samples with the OTU present
    mean_abuns_present <- c(mean_abuns_present, 
                            sum(genus_df[o, ] - zeroabun) / sum(genus_df[o, ] > zeroabun)
                            )
    
    # ubiquity of OTU across all samples
    ubis <- c(ubis, 
              sum(genus_df[o, ] > zeroabun) / length(genus_df[o, ])
              )
    
    # effect (logfold difference)
    effs <- c(effs, 
              log(mean(genus_df[o, ctrl_idx])/mean(genus_df[o, case_idx]))
              )
  
  }
  
  res <- data.frame(otu = rownames(genus_df), 
                    pval = pvals, test_statistic = teststats,
                    SE = ses, effect_size = effs, 
                    mean_abun = mean_abuns, mean_abun_present = mean_abuns_present,
                    ubiquity = ubis)
  res <- res %>%
    mutate(test_statistic = qnorm(exp(log(pval)-log(2)), lower.tail=FALSE) * sign(effect_size))
  
  save(res, file=resfile)
  
}else{
  load(resfile)
}

Check Covariate Diagnostics

Here we look to see if the covariates do indeed look informative.

Ubiquity
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=10, binwidth=0.05)

rank_scatter(res, pvalue="pval", covariate="ubiquity")

Something weird is happening with p=0.20 and p=0.40 - maybe this has something to do with ties? Either way, ubiquity looks to be a bit informative (from the scatter plot, but not really the histograms…)

Mean abundance (across non-zero samples)
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)

rank_scatter(res, pvalue="pval", covariate="mean_abun_present")

Let’s use ubiquity as our ind_covariate

res <- res %>% mutate(ind_covariate = ubiquity)

Also (hacky), remove test_statistic so that ash doesn’t run.

res <- res %>% 
  mutate(test_statistic = NA) %>%
  mutate(effect_size = NA)

Set up BenchDesign object

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We’ll do this for a few values of nmids, to explore its effects on the outputs.

Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

for (nmids in c(5, 50, 150)){
  print(nmids)
  bd <- initializeBenchDesign(nmids=nmids)
  
  resfile <- paste0(resdir, "baxter_summarizedBenchmark_",
                    nrow(res), "_genera_nmids_", nmids, ".RData")
  duration <- NA
  if (!file.exists(resfile)){
    t1 <- proc.time()
    sb <- bd %>% buildBench(data=res, ftCols = "ind_covariate")
                            #parallel=TRUE, BPPARAM=multicoreParam)
    metadata(sb)$data_download_link <- "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
    save(sb, file=resfile)
    duration <- round((proc.time()-t1)[3]/60,1)
  }else{
    load(resfile)
  }
}

Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.

# rename assay to qvalue
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Now, we’ll plot the results.

Debugging note: to look at the matrix of qvalues, run assays(sb)[["qvalue"]]

# plot nrejects by method overall and stratified by covariate
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
## Warning: Removed 30 rows containing missing values (geom_point).

rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
## Warning: Removed 120 rows containing missing values (geom_point).

#plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )

Note: Benjamini-Hochberg (bh) overlaps exactly with the IHW results.

plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )

methods <- c( "lfdr", "ihw-a10", "bl-df03", "qvalue", "bh", "bonf" )
plotCovariateBoxplots( sb, alpha=0.1, nsets=6, methods=methods)